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Abstract 


This  paper  presents  two  new  formal  frameworks  for  learning.  The  first  framework  requires  the 
learner  to  approximate  an  unknown  function,  given  examples  for  the  funtion  as  well  as  seme  background 
information  on  it.  It  is  shown  that  this  framework  is  no  more  powerful  than  a  framework  that  allows  the 
learner  to  see  examples  but  net  background  information.  The  second  framework  explores  learning  in  the 
sense  of  improving  computational  efficiency  as  opposed  to  acquiring  an  unknown  concept  or  function. 
Specifically  ,  the  framework  concerns  the  acquisition  of  heuristics  from  examples  over  problem  domains 
of  special  structure.  A  theorem  is  proved  identifying  some  conditions  sufficient  to  allow  the  efficient 
acquisition  of  heuristics  over  the  aforementioned  class  of  domains. 
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1.  Introduction 

This  paper  concerns  learning  algorithms  -  algorithms  that  construct  gccd  approximations  to 
unknown  functions  from  examples  for  those  functions.  The  recent  interest  in  formal  methods  in  machine 
learning  started  with  the  introduction  of  a  formal  framework  for  concept  learning  in  [Valiant  1984],  S.r.ce 
then,  the  framework  has  been  extended  and  analyzed  by  numerous  authors  [Biumer  et  al.  1985. 
Natarajan  1987a,  Kearns  et  al.  1987],  Unfortunately,  the  framework  appears  rather  limited  in  sccpe  and 
does  not  seem  to  capture  the  essence  of  many  of  the  learning  paradigms  and  architectures  in  use  by  the 
experimentalists.  Since  one  of  the  important  goals  of  theoretical  research  in  machine  learning  is  to 
develop  a  general  framework  for  the  problem,  it  is  necessary  to  formulate  and  analyze  alternative 
frameworks  that  capture  the  behaviour  of  learning  models  popular  among  workers  in  Artificial  Intelligence. 
With  the  above  in  mind,  this  paper  presents  two  new  frameworks  for  learning  (a)  a  learning  framework 
that  captures  the  essential  ingredients  of  what  is  called  a  "learning  architecture"  in  the  Al  literature,  :b)  a 
learning  framework  for  the  acquisition  of  heuristic  rules  as  a  means  of  improving  computational  efficiency. 
The  former  is  a  framework  that  provides  the  learning  algorithm  with  randomly  chosen  examples  of  the 
function  to  be  learned  and  some  background  information  or  "theory"  about  the  function  to  be  learned.  It  is 
a  widely  held  intuition  among  workers  in  Artificial  Intelligence  that  such  a  framework  is  strictly  more 
powerful  than  the  one  of  [Valiant  1984],  The  latter  is  a  framework  specifically  aimed  at  algorithms  that 
construct  heuristics  in  problem-solving  domains  such  as  symbolic  integration.  In  analysing  these  two 
frameworks,  we  prove  two  theorems,  one  on  each  framework. 

We  begin  by  extending  the  results  of  [Biumer  et  al.  1986.  Natarajan  1937]  on  the  learnability  of 
boolean-valued  functions  to  the  learnabality  of  general  functions.  To  do  so,  we  give  a  new  and  simple 
definition  of  the  dimension  of  a  family  of  functions  and  use  it  to  prove  a  theorem  identifying  the  most 
general  class  of  function  families  that  are  learnable  from  polynomial^  many  examples.  Our  results  hold 
only  for  discrete  domains.  For  continuous  domains,  we  shew  how  the  results  of  [Biumer  et  al  1986]  for 
boolean-valued  functions  may  be  modified  to  include  general-valued  functions.  We  also  establish  that  our 
notion  of  dimension  is  equivalent  to  the  more  complicated  Vapnik-Chervonenkis  dimension  of  [Biumer  et 
al  1 986.  dapnik  and  Chen/onenkis  1971],  The  theorem  of  this  section  will  be  heavily  used  in  the  following 
sections  and  is  the  first  of  our  results. 

We  then  propose  a  new  framework  for  learning,  one  that  attempts  to  capture  the  essential 
ingredients  of  the  "general  learning  architectures"  of  the  experimentalists  [Laird  et  al  1986.  Mitchell  et  al 
1986],  This  is  a  maicr  contribution  of  the  paper.  Specifically,  the  framework  requires  the  learning 
algorithm  to  learn  a  function  from  examples  for  the  function.  The  examples  are  picked  at  random  by  the 
teacher.  In  addition,  the  teacher  provides  the  learner  with  some  "theory"  relevant  to  the  concept  to  be 
learned,  with  the  understanding  that  the  concept  to  be  learned  is  consistent  with  the  "theory"  presented 
For  instance,  when  teaching  a  concept  in  geometry,  the  teacher  may  present  the  learner  with  some  basic 
theorems  in  geometry  in  addition  to  examples  for  the  concept  to  be  I  earned,  in  the  hepe  that  this  would 
accelerate  the  learning  process.  Our  main  result  here  is  a  theorem  stating  that  the  class  of  function 
families  learnable  in  this  framework  (i.e.  from  few  examples  and  short  theories;  is  exactly  the  class  of 
families  learnable  in  the  framework  of  [Valiant  84]  (i.e  from  few  examples  and  no  theories).  As  it 
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happens,  the  proof  of  this  theorem  is  remarkably  simple,  owing  to  the  intuitive  strength  cf  the  new  notion 
of  dimension  introduced  in  this  paper  Vet.  the  theorem  has  some  unintuitive  consequences.  Firstly.  ,t 
directly  implies  that  learning  from  background  information  and  examples  is  no  mere  powerful  than 
learning  from  examples  alone.  This  contradicts  the  beliefs  prevalent  in  the  Artificial  Intelligence 
community.  Secondly,  and  more  subtly,  the  theorem  leads  to  the  realization  that  although  background 
information  cannot  reduce  the  information  complexity  of  learning,  it  could  reduce  the  computational 
complexity  of  processing  the  information  obtained  from  examples.  This  opens  up  a  rich  new  area  of 
theoretically  interesting  problems,  one  of  which  is  stated  in  this  paper  but  left  open. 

Finally,  we  develop  a  learning  framework  that  explores  learning  as  a  means  of  improving 
computational  efficiency  rather  than  learning  new  concepts.  Consider  the  problem  of  learning  symbolic 
integration.  Theoretically  speaking,  given  a  table  of  integrals  the  student  should  become  an  expert 
instantly.  However,  the  student  appears  to  need  some  sample  problems  ana  solutions  before  he 
develops  any  facility  with  integrals.  Cur  framework  attempts  to  capture  the  flavour  of  the  above.  Define  a 
problem  domain  D  on  an  alphabet  I  to  be  the  pair  (G.O)  where  G  is  the  goal  function  (boolean  valued 
function  on  I*),  and  0  is  the  set  of  operators  (length  preserving  functions  on  I*).  These  notions  will  be 
made  precise  later.  An  algorithm  for  D  would  take  an  input  string  .r  and  transform  it  using  the  operators  in 
0  so  that  the  transformed  string  satisfies  G,  if  such  is  possible.  A  meta-domain  M  is  simply  a  set  of 
domains,  and  a  meta-algorithm  for  ,\t  is  an  algorithm  that  takes  as  input  the  specification  of  a  domain  D  e 
M.  calls  for  a  small  number  of  randomly  selected  examples  for  D,  and  produces  as  output  an  efficient 
algorithm  for  the  domain  D.  To  illustrate  the  power  of  the  framework,  we  exhibit  a  set  cf  domains  each  of 
which  possesses  a  simple  polynomial  time  algorithm.  We  show  that  although  the  task  of  computing  an 
efficient  algorithm  for  a  domain  from  its  specification  is  NP-complete  for  this  set  of  domains,  the  task  is 
quite  tractable  within  our  framework.  We  then  prove  a  theorem  identifying  some  conditions  sufficient  to 
allow  the  existence  of  a  meta-algorithm  within  the  proposed  framework.  To  our  knowledge,  this  is  the  first 
formalization  where  examples  provide  no  new  information  to  the  learner,  and  serve  only  to  improve  the 
computational  complexity  of  processing  the  information  already  possessed  by  the  learner. 

2.  Preliminaries 

We  now  describe  our  version  of  the  learning  framework  proposed  by  [Valiant  1934].  We  will  call  this 
Framework  1,  to  distinguish  it  from  those  that  fellow  Without  loss  of  generality,  let  I  be  the  binary 
alphabet  and  I*  the  set  of  all  binary  strings.  We  consider  functions  from  I*  to  I*.  An  example  of  a 
function  /  is  a  pair  (x,f(x)).  A  learning  algorithm  is  an  algorithm  that  attempts  to  infer  a  function  from 
examples  for  it.  The  learning  algorithm  has  at  its  disposal  a  routine  EXAMPLE,  that  at  each  call  produces 
an  example  for  the  function  to  be  learned.  The  probability  that  a  particular  example  i.o  i  will  be  produced 
by  a  call  of  EXAMPLE  is  Ptx),  as  given  by  the  probability  distribution  P  Also,  the  probability  that  the 
learned  function  will  be  queried  on  a  particular  string  t  is  Pixi  The  distribution  p  can  be  arbitrary  and 
unknown. 


We  define  a  family  of  functions  P  to  be  any  set  of  length  preserving  functions  from  V  to  I*  The 


«:h-subfamily  Fn  of  a  family  F,  is  the  family  of  functions  induced  by  F  on  IT  Specifically,  if  F  =/., . 

then  Fn  =  where  "  is  defined  as  follows. 

S’A-t)  =  /V’  if  Lvi  =n 

undefined  otherwise 

A  basis  for  F n  is  a  subset  Dn  of  F  such  that  for  each  g  e  Fn,  there  is  exactly  one  function /  e  D,  such 
that/ and  g  agree  on  IT 

Following  [Valiant  1984],  we  say  that  a  family  of  functions  is  learnable  if  there  exists  a  uniformly 
convergent  learning  algorithm  for  it.  Specifically,  a  family  of  functions  F  is  learnable  if  there  exists  a 
learning  algorithm  that 

(a) takes  as  input  integers  n  and  h. 

(b) makes  polynomial^  many  calls  of  EXAMPLE,  both  in  the  adjustable  error  parameter  h  and  in  the 
problem  size  n.  EXAMPLE  produces  examples  of  some  function  in  Fn. 

(c)  For  all  functions  /  in  Fn  and  all  probability  distributions  P  on  I”,  with  probability  .1-1  /:>  the 
algorithm  outputs  a  function  g  in  F  such  that 

y  p(x)  <  i  ih 

X  s  5 

where 

S  =  Ul  \xl  =  n  and/(.r)  *  g(jr)} 

We  assume  that  the  learning  algorithm's  output  is  the  index  of  the  learned  function  in  some 
acceptable  indexing  of  the  functions  in  family  F.  Furthermore,  if  the  learning  algorithm  runs  in  time 
polynomial  in  n  and  h,  we  say  that  the  family  is  pciyncmial-time  learnable. 

The  dimension  of  a  sub-family  Fn,  denoted  by  dir>uFn).  is  given  by 
dim(Fn )  =  logOFJ)/(2n). 

A  family  F  is  of  dimension  Din)  if  for  all  n ,  dim{F  )  <  Din).  If  Din)  is  polynomial  in  «,  we  say  that  F  is  of 
polynomial  dimension. 

For  any  set  of  examples  S,  define  the  set  n^-tSi  as  the  set  of  all  subsets  of  S  obtained  by  intersecting 
S  with  the  functions  in  F.  i.e 
IV S)  =  {R\  R  c  5,  and  z /e  F  such  that 
/ agrees  with  S  on  it 
and  disagrees  with  S  on  S-R\. 

If  n fdS)  =  2s,  we  say  that  F  shatters  S. 


Lemma  1 :  If  F n  is  of  dimension  d.  then  there  exists  a  set  of  d  examples  that  is  shattered  by  Fn 
Proof:  Omitted  for  brevity.  Please  see  [Matarajan  1937b],  • 

Theorem  1 :  A  family  of  functions  is  learnable  if  and  only  if  it  is  of  polynomial  dimension. 
Proof:  Omitted  for  brevity.  Uses  Lemma  1  Please  see  [Natarajan  1937b].  • 


**.  r.  ■*,  -* . 


•.  f.  -r. 
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As  our  results  above  are  based  on  information  theoretic  methods,  it  is  difficult  to  extend  them  directly 
to  continuous  spaces  where  each  example  can  be  of  infinite  length.  Cn  the  other  hand,  the  results  " 
[Slumer  et  al  1SS6]  for  learning  boolean-valued  functions  are  obtained  using  some  classical  results  >n 
probability  theory  and  are  valid  over  continuous  domains.  In  the  following,  we  show  hew  to  extend  their 
results  to  general  functions. 

As  in  [Blumer  et  al.  1386],  we  define  the  Vapnik-Chervcnenkis  dimension  d^F)  of  a  family  F  as 
follows.  dvc(F)  is  the  smallest  integer  d  such  that  no  set  of  cardinality  d+ 1  is  shattered  by  F 

Since  we  no  longer  need  the  notion  of  a  sub-family,  we  modify  our  definition  of  learnabllity 
accordingly.  In  particular,  a  family  of  functions  F  is  learnable  if  there  exists  an  algorithm  that 
(a)takes  as  input  an  integer  h. 

(bimakes  pelynomiaily  many  calls  of  EXAMPLE,  polynomial  in  the  adjustable  error  parameter 
(c)as  in  the  earlier  definition  of  learnabllity. 

With  these  definitions  in  hand,  -we  can  state  the  following  theorem. 

Theorem  2:  For  any  finite  alphabet  I,  a  family  of  functions  from  I*  to  I*  is  learnable  if  and  only  if  it  is 
finite  Vapnik-Chervonenkis  dimension. 

Proof:  The  proof  of  this  theorem  is  similar  to  the  proof  of  the  corresponding  theorem  for  boolean 
valued  functions  [Blumer  et  al.  1386].  • 

To  establish  the  relationship  between  the  two  measures  of  dimension,  we  have  the  following. 

Theorem  3:  For  any  family  F 

dimyFn)  <  d,^Fn)  <  (2n)dim(F  n). 

Proof:  Omitted  for  brevity.  Please  see  [Natarajan  1  S87b] .  • 

Lastly,  we  give  a  result  that  attempts  to  introduce  computational  complexity  into  Theorem  1.  Define 
an  ordering  of  a  family  of  functions  to  be  an  algorithm  that 

(a) takes  as  input  an  integer  n  and  a  set  S  =  [e,,  ]  of  examples  such  that  each  is  a  pair  of 

strings  of  length  n. 

(b) produces  as  output  a  function /  e  F  that  is  consistent  with  S.  if  such  exists,  i.e,  i.v.y)  e  .S'  implies 
y=f(x). 

Furthermore,  if  the  ordering  runs  in  time  polynomial  in  the  length  of  its  input,  we  say  it  is  a 
polynomial-time  ordering  and  F  is  polynomial-time  orderable. 

Theorem  4:  A  family  of  functions  is  polynomial-time  learnable  if  it  is  of  polynomial  dimension  and  is 
polynomial-time  orderable. 


Proof:  Follows  from  that  of  Theorem  1  • 
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3.  Learning  Architectures 

Workers  in  Artificial  Intelligence  have  long  sought  to  build  general-purpose  learning  programs  that 
may  be  used  over  many  domains.  Specifically,  such  programs  or  "architectures"  take  as  input  a 
description  of  the  family  of  functions  to  be  learned  and  after  some  precomputation  behave  like  learning 
algorithms  for  that  family.  We  will  refer  to  such  algorithms  as  "learning  architectures" 


Consider  a  learning  architecture  \1  that  works  over  a  set  of  families  G,,  G,....G,....  i.e,  M  takes  as 
input  the  description  of  some  Gt  and  then  behaves  as  a  learning  algorithm  for  G,.  Now.  if  G  =  G ,  ^  G- 
is  itself  a  family  of  low  dimension,  then,  it  follows  from  Theorem  1  that  we  can  build  a  learning 
algorithm  for  G  and  not  bother  with  the  complications  of  M  The  interesting  question  is  whether  it  is 

possible  for  G  to  be  of  intractably-high  dimension  and  yet  be  decomposable  into  G- . G....  such  that  each 

G,  is  of  low  dimension  and  each  G.  has  a  short  description  that  can  be  fed  into  the  learning  architecture 
In  order  to  answer  this  question,  we  consider  the  framework  of  the  following  section. 

3.1  Learning  from  Examples  and  Background  Information 

We  now  present  a  learning  framework  that  allows  the  learning  algorithm  to  see  examples  for  the 
function  to  be  learned  as  well  as  some  background  information.  We  will  call  this  Framework  2. 

Let  F  be  a  family  cf  functions.  A  theory  for  F  is  simply  any  total  function  from  F  to  I* . 


A  learning  algorithm  for  F  is  an  algorithm  that  attempts  to  infer  functions  in  F  from  examples  and 
background  information.  The  learning  algorithm  has  at  its  disposal  a  routine  TEACHER,  that  is  best 
described  as  the  pair  <EXAMPLE.  />.  where  EXAMPLE  is  the  source  cf  random  examples  described  in 
Framework  1  and  T  is  a  theory  for  F  When  attempting  to  teach  the  learning  algorithm  any  function  /  e 
Fn  Cn  the  first  call  cf  TEACHER.  TEACHER  returns  it  e  T:f)  where/ is  any  function  in  F  that  agrees  with 
!\  cn  I”  Cn  subsequent  cails.  TEACHER  returns  a  randomly  chosen  example  for  /■  by  invoking 
EXAMPLE  recursively. 

We  say  that  a  family  cf  functions  F  is  learrapie  in  Framework  2  if  there  exists  a  learning  algorithm 
and  a  theory  /'  for  F  such  that 
;a).t  takes  as  input  integers  n.  h. 

!b;.\  makes  polynomial!’/  many  calls  cf  TEACHER  =  <EXAMPLE./>.  polynomial  in  n  and  k. 
TEACHER  should  return  a  theory  of  length  polynomial  in  n 

(C; For  all  functions/ in  Fn.  and  all  prcbab.l.ty  d;str:but.cns  P  over  the  examples  ‘or  f.  the  algorithm 
deduces  with  probability  >1-1  in  a  function  in  F  such  that 

£  Pi. x)  <  \ih 

X  ~  S 

where 

S  =  l  .tl  '.ti  =  n  and  f(x)  *  ;>i  t!  j 


Furthermore  if  the  learning  a'gcrfhm  runs  :n  time  polynomial  n 
pclyr c  mial- !< me  teamacie . 


we  say  that  r 


» 


Abus.rg  rcfaf  cn.  we  extend  !ue  thecn/  func;  cn  I  to  subsets  of  F 


-  ^  *  -  1  -  . 


.'AV. 


Also,  we  define  the  inverse  cf  a  theory  /'to  he  the  *  anchor  r  ‘on  I"  tc  sunsets  ■  as  g.ver  be- lev. 
For  re  I*.  T{:)  =  ijl/e  F.  /'t/>  =  ;; 

We  are  now  ready  to  state  cor  main  result 

Theorem  5:  A  family  of  functions  F  is  learnabie  Framework  2  ;f  and  only  if 

(a) /7  is  of  polynomial  dimension. 

(b) /7  is  learnabie  in  Framework  1 

Proof:  (Part(a))  (if)  By  Theorem  1 .  if  F  is  of  polynomial  dimension,  then  F  is  learnabie  ;n  Fr: 
1.  Hence  it  is  learnabie  in  Framework  2  as  Framework  1  is  but  a  special  case  of  Framework  2 

(only  if)  Let  A  be  a  learning  algorithm  for  r  in  Framework  2  using  a  TEACHER  =<EXA7FL 
some  theory  T  for  F.  For  any  n,  let  /'  be  the  set  of  theories  offered  by  TEACHER  over  ail  the  fur 
Fn.  Surely  Tn  =  T{D^\  for  some  basis  3,  for  F n.  Whatever  the  interpretation  cf  the  theories  used 
set  of  functions  considers  consistent  with  a  theory  ;  e  Tn  contains  the  set  7“r  >  -  B„  Hence,  if 
A  requires  polynomial^  many  examples  after  seeing  tr  then  by  Theorem  i.  t~im  -■  3n  mu 
dimension  bounded  by  a  polynomial  in  n.  Also,  the  length  of  the  theories  must  be  bcur.c 
polynomial  in  n  as  is  a  learning  algorithm  for  F  in  Framework  2.  From  these  two  bounds 
following  claim,  we  conclude  that  F  is  of  polynomial  dimension. 

Claim  1 :  Let  Fn  be  the  n-th  subfamily  of  a  family  F,  Bn  a  basis  for  Fn.  and  T  any  theory  for 
be  a  learning  algorithm  for  F  with  TEACHER  =  <EXAMPLE.  T>.  Then,  there  exists  a  theory  : 
such  that 

Zrniimi  B  )+length(: )  >  n-dimyFj. 

Proof: 

Let  T  =  TiB  ). 

n  nJ 

Since  7/e  F.f  e  7“(. /'■/)),  we  have 

fl,  =  T  ru). 

New  F  \  =  LG  I  =  ! WT  7Wn  '. 

^  n  i 

Let ;  =  mux  {lengtkijj  I  i(  e  T n j 
and  d  =  max  n!/  £  T 

t  ■  i  n  1 

Hence. 

iF  i  <  2'  2:nJ  and  hence 

n 

-'ZnJi/n' /"  o/  *• 

Which  in  turn  implies  that 
5  ;  g  /'  such  that 

U’n'Jiki t  i  +•  2 n  dim'  f~i:  )t  >  2/i  dimi F  '  2 

»  t  H 

=  ndimAF  i 

n 

which  is  as  required. 
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(Part  (b))  Follows  from  (part  (a))  and  Theorem  1.  This  completes  the  proof.  •. 

This  answers  our  question  at  the  beginning  of  this  section:  If  G  is  a  family  of  high  dimension,  then  G 
is  not  decomposable  into  component  families  of  low  dimension  with  short  descriptions.  It  is  the 
understanding  of  this  author  that  it  is  widely  believed  in  the  Artificial  Intelligence  community  that  learning 
architectures  can  be  efficiently  applied  to  domains  of  intractably-high  dimension  [Mitchell  1S87],  As  we 
see  from  the  above,  this  is  not  true.  Does  this  mean  that  learning  architectures  are  not  very  useful?  No, 
for  three  reasons.  The  first  reason  is  primarily  of  theoretical  interest.  Specifically,  if  \'P  *  RP,  there  are 
families  of  functions  that  are  polynomial  time  learnable  in  Framework  2,  but  not  polynomial  time  learnable 
in  Framework  1 . 

Theorem  6:  If  a  family  F  is  polynomial  dimension,  then  F  is  polynomial-time  learnable  in  Framework 

2. 


Proof:  For  each  /.  in  Fn,  simply  choose  tt  to  be  the  index  of  ft.  Since  dim(Fn)  is  polynomial  in  n.  there 
exists  a  basis  Dn  for  Fn  such  that  the  indices  of  Dn  are  of  length  polynomial  in  n.  • 

If  SP  *  RP.  then  we  know  that  there  exist  function  families  that  are  of  polynomial  dimension  but  are 
not  polynomial-time  learnable  [Kearns  et  al.  1987],  Hence  we  have  the  following: 

Corollary:  If  XP  ==  RP,  then 

[F\  F  is  p-time  learnable  in  Framework  1 )  £  (Ti  F  is  p-time  learnable  in  Framework  2). 

The  second  reason  is  of  practical  interest.  Let  .a,  and  .u  be  two  polynomial  time  learning  algorithms 
for  a  family  F  in  Frameworks  1  and  2  respectively.  Now,  .A,  could  run  in  time  as  little  as  nh  di»uFn)  cn 
inputs  (n.h)  [Natarajan  1987b,  Theorem  1).  ,u  could  run  in  time  n-dim(Fn )  on  the  same  input,  simply  by 
choosing  the  theories  to  be  the  indices  of  the  functions  as  in  the  proof  of  Theorem  6.  Thus,  .A,  ccu'd  be 
faster  than  .a,  by  a  factor  of  h,  something  that  could  be  of  significant  practical  importance. 

Thirdly,  in  situations  where  the  cost  of  obtaining  an  example  is.  bit  for  bit,  significantly  more  than  the 
cost  of  a  comparable  amount  of  background  information,  it  is  advantageous  to  use  all  the  "thecny" 
available.  Again,  this  is  cf  practical  significance. 

3.2  An  Open  Problem 

First  some  notation:  we  use  >a,  =a,  <a  to  denote  asymptotically  greater  than,  equal  to  and  less  than 
respectively,  as  relations  on  functions. 


The  last  corollary  prompts  that  we  ask  the  following  question.  Is  there  a  learning  hierarchy  over  the 
complexity  measure  of  theory  length9  Specifically,  does  there  exist  an  infinite  collection  cf  functions 
(y, in).  %^n)...q  (n)...}  where  >a  .  such  that  for  each  >t,  there  exists  a  family  of  functions  that  is 
p-time  learnable  with  long  theories  but  not  with  j»(_,  long  theories?  In  an  attempt  to  answer  this 
question,  we  consider  the  following  model  of  computation.  Let /:!*  ->  I*  be  a  function.  An  algorithm  ,\  is 
said  to  compute  /  with  g(n)  long  theory  if 


(a) . \  receives  x  as  input  and  produces  fix)  as  output. 

(b) .-\  also  receives  eij\x)).  where  •.>:!*  — >  I*  is  a  function  such  that  AmmI  <  v(';-  '  We  call  the 
theory  or  advice  function  and  cijXxd  the  advice.  We  also  say  that  t  receives  advice  of  length  ,,> 

The  intent  here  is  to  provide  the  algorithm  \  with  some  short  advice  on  the  output  string  >.  short 
compared  to  the  length  of  y.  (While  this  model  may  appear  similar  to  the  model  of  [Karp  and  Upton 
1S80],  it  is  quite  different  altogether.)  We  now  ask  whether  there  exists  a  hierarchy  of  functions  y,  < 

<j...  such  that  for  each  there  exists  some  function  /  that  is  p-time  computable  with  advice,  but  net 
with  advice. 

Define  FSAT  to  be  the  following  problem 
Input:  A  boolean  formula  <t>  of  n  variables 

Output:  Any  satisfying  assignment  for  <tv 

Clearly  FSAT  is  NP-complete.  Using  FSAT.  we  can  exhibit  a  weak  hierarchy  as  follows. 

Claim  2:  If  NP  z  .DTiUErj' A  for  some  »<n)  >  hgn,  then  for  any  mi  such  that  to-jn  <  nm  < 

there  exists  a  function  that  is  computable  in  polynomial  time  with  i>~nn)  advice,  but  not  with  mi 
advice  It  is  assumed  that  y  is  a  one-one  function  and  is  the  inverse  of  ;> 

Proof:;sketch)  By  assumption.  FSAT  t  DTlMEil f1"")  and  hence  is  not  computable  with  advice 
But  surely.  FSAT  is  computable  with  n  advice.  This  proves  the  claim  for  run  =  i  By  a  simple  paeding 
argument,  this  can  be  generalized  to  any  nm,  to^n  nn>  <j  ,-i/n,  completing  the  proof.  • 

We  can  also  exhibit  an  equally  weak  learning  hierarchy  as  follows. 

Claim  3:  If  NP  d  ,/?77A/£(2*,',<))  for  some  yn i  >j  10-4/1,  then  for  any  rini  such  that  ic-^n  <d  nn» 
<j  i ’in),  there  exists  a  family  of  functions  that  is  polynomial  time  learnable  with  ; >'rim  theory,  but  not  with 
n n i  theory  (Here.  RTIME  stands  for  random-time,  and  again  §  is  assumed  one-one  ) 

Proofqsketch)  Similar  to  the  proof  of  the  previous  claim.  Hinges  on  the  result  of  [Kearns  et  al  1937] 
showing  the  problem  of  ordering  boolean  threshold  functions  to  be  NP-complete. • 

Unfortunately,  the  abeve  hierarchies  are  rather  weak,  and  are  based  on  strong  assumptions  While 
we  do  not  have  stronger  results,  we  feel  compelled  to  point  out  that  this  problem  might  be  cf  interest  from 
the  cryptography  viewpoint  as  well.  Specifically,  suppose  that  E  were  a  cryptographically  secure 
encryption  function  with  an  w-bit  key.  Given  polynomially  many  examples  cf  the  form  i  viTtvi.  and 
m-Oim)  bits  as  advice  on  the  key.  is  it  possible  to  efficiently  compute  the  key  cf  an  encryption  function 
that  agrees  with  the  examples9 

We  close  this  section  with  a  conjecture. 


Conjecture:  if  P  --  NP.  FSAT  is  not  polynomial  time  computable  with  »n  =  n-Oini  advice. 
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4.  Learning  as  Improvement  in  Computational  Efficiency 

In  this  section,  we  develop  a  framework  to  explore  learning  in  the  sense  of  improvise  computational 
efficiency.  This  is  of  considerable  practical  importance  [Mitchell  1983). 

Define  a  problem  domain  D  to  be  the  pair  (G,  O),  where 

(a) The  goal  function  G:X*  ->  (0,1)  is  a  total  function  from  X*  to  (0,1)  computable  in  polynomial  time. 

(b) G  is  a  finite  set  of  operators  (oj,  o2,...)  where  each  o-.V  is  a  length  preserving  function 
computable  in  polynomial  time.  The  operators  need  not  be  total  functions. 

For  the  problem  of  symbolic  integration  discussed  in  the  introduction,  G  would  simply  be  the  rule  that 
the  expression  was  free  of  integral  signs  and  the  operator  set  0  would  be  a  table  of  standard  integrals. 

The  specification  of  a  domain  D  =  \G,0)  is  a  set  of  programs  for  G  and  0  that  run  in  polynomial  time. 
Notation:  for  any  string  we  denote  the  length  of  x  by  W.  We  say  xe  X*  is  solvable  if  there  exists  a 
sequence  a  of  operators  in  0  (written  ae  O')  of  length  l.rl  or  less  such  that  Gic(x >>  =  1 .  c(x)  is  a  solution  of 
x  and  a  is  a  solution  sequence  of  x.  An  algorithm  tor  D  is  a  deterministic  program  that  takes  as  input  xe 
X*  and  computes  a  solution  sequence  for  x,  if  such  exists. 

A  meta-dcmain  M  is  any  set  of  domains  such  that  every  domain  in  \l  is  defined  on  the  same 
alphabet.  A  meta-algorithm  for  \l  is  an  algorithm  that  takes  as  input  the  specification  of  any  domain  D  e 
\t  and  computes  as  output  an  algorithm  for  D. 

Example:  Let  X  =  •  0. 1  ,S>.  For  any  boolean  function  O  of  n  variables,  let  r<<£>)  denote  the  following 
function  from  X*  to  (0.1). 

F ;  <t>)(x)  =  Out)  if  x  =  y$.  y  e  (0+1  )n 
=0  otherwise. 

Let  o , ,  o2  be  functions  from  X’  to  (0+1)  given  by 
o.lx)  =  .rOSy  ,  x  Of  the  form  .tSuy,  .n  e  (0+1)*, 
a  e  f  0+ 1 ). 

=x  otherwise 

ri.i.r)  =  xlSy  ,  x  of  the  form  xSay,  xye  (0+1  >*, 

Lie  1 0+ 1 1. 

=x  otherwise 

Let  M  be  the  collections  of  all  domains  of  the  form  iG.G)  where  G  =  r(<t>)  for  some  boolean  function  <t>  and 
o  is  the  two  operators  defined  above. 

It  is  easy  to  see,  that  constructing  an  algorithm  for  an  arbitrary  domain  D  e  M  is  equivalent  to 
deciding  the  satisfiability  of  boolean  formulae.  Ftence.  if  P*  SP,  M  does  not  have  a  polynomial-time 
meta-algonthm.  We  break  here  for  a  definition.  • 

An  example  for  a  domain  D  is  a  pair  tx.op,  xe  X*.  o[  e  Ok,  k  s  Irl  such  that  GtatorD  =  1 . 

Examplehccntmued)  Returning  to  our  example,  we  see  that  if  the  meta-algorithm  were  allowed  to 


AY 


/  * 


see  a  single  example  for  its  input  domain,  its  task  is  trivial.  • 

The  point  behind  the  example  is  as  follows.  Given  a  domain  D,  it  might  be  computationally 
intractable  to  compute  an  efficient  algorithm  for  D,  even  if  we  knew  that  such  existed.  Vet,  seeing  solved 
examples  for  the  input  domain  allows  an  efficient  algorithm  to  be  constructed  quickly.  The  examples 
serve  to  improve  the  computational  efficiency  of  the  meta-algorithm,  and  hence  we  view  this  as  learning 
in  the  sense  of  improving  efficiency  as  opposed  to  concept  learning. 

To  furnish  the  meta-algorithm  with  examples,  we  place  at  its  disposal  a  routine  EXAMPLE,  similar  to 
the  one  of  Framework  1.  At  each  call,  EXAMPLE  returns  a  randomly  chosen  example  for  the  input 
domain. 

We  say  a  meta-domain  M  allows  heuristics  if  there  exists  a  meta-algorithm  for  M  such  that 

(a) .A  takes  as  input  integers  n,  h  and  the  specification  of  a  domain  D  e  M.  Let  r  be  the  least  upper 
bound  of  the  running  time  on  inputs  of  length  n  of  the  programs  in  the  specification  of  D. 

(b) .-v  computes  for  time  polynomial  in  n ,  h,  the  length  of  its  input,  and  t.  A  may  call  EXAMPLE,  which 
returns  examples  for  D,  chosen  according  to  some  unknown  distribution  P  over  the  solvable 
subset  of  E\ 

(c)  For  all  De  Af  and  all  distributions  P  over  with  probability  (l-l//i)  A  outputs  a  program  11  n  that 
approximates  an  algorithm  for  D  in  the  sense  that 

]T  P(x)  <  1  /h 

x  €  S 

where 

S  =  f.r|  lx! =n,  11  n  is  incorrect  on  x). 

(d)  For  any  two  inputs  {l,hxjD)  and  ( m,h2D ),  l>  m,  let  A  output  //,  and  llm  respectively.  Then 

Ills  run  time  on  Zl 

— - <  (/An)* 

11  )n  s  run  time  on 

where  k  is  a  constant  that 

depends  only  on  D. 

Conditions  (a)  through  (c)  in  the  definition  above  are  as  in  Framework  1,  and  have  the  same 
purpose.  Condition  (d)  is  a  uniformity  condition  requiring  that  the  run  time  of  the  algorithm  output  by  A 
grows  polynomial^  with  the  length  of  the  strings  it  is  useful  on. 

Let  D=(G,0)  be  a  domain.  For  each  operator  oe  O, and  integer  /  >  1,  consider  the  set 
i\(o)  =  (xl  o(x)  has  a  solution  sequence 
of  length  Ul-i  or  less). 

We  call  the  Ut(o)  the  preimages  of  o  in  D,  and  we  call  the  collection  of  preimages  for  all  the  operators  in  D 
the  preimages  of  D. 

Claim  4:  For  any  domain  D,  given  efficient  programs  to  test  membership  in  the  preimages  of  D,  we 
can  construct  an  efficient  algorithm  for  D. 


Proof:  Consider  the  following  algorithm 
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input  .x,  Ltl  =  n 

begin 

<t  null-sequence  ; 
for  i  =  1  to  n  do 

pickoe  0  such  that  ,x  e  Uyo) ; 
if  no  such  exists,  fail; 
x  <-  o(.x); 
a  <—  ao; 
od 

output  a,  a  solution  sequence  for  x. 

end 


Clearly  this  is  an  algorithm  for  D.  • 

We  need  one  more  definition  before  we  can  state  the  second  of  our  main  results.  Let  F  be  a  family 
of  functions  from  I*  to  {0,1}  for  some  alphabet  I.  With  any  fs  F  we  associate  the  set  5.-  =  {.xl  f(x>  =  1), 
Any  string  .te  I*  is  a  positive  example  for/e  F  if  x  s  We  say  that  F  is  well-ordered  if  for  any  set  S  of 
strings  in  I*  such  that  S  c  S^for  some  f  e  F,  there  exists  a  least  g  e  F  such  that  S  c  Sg.  i.e,  for  all  •>'  e 
F,  S  c  S '  .  implies  that  Sg  e  Sg,.  An  ordering  for  a  well-ordered  family  is  similar  to  an  ordering  for  general 
families  as  defined  in  section  1,  except  that  it  takes  as  input  a  set  of  positive  examples  and  outputs  the 
least  function  consistent  with  these  examples  as  defined  above.  For  more  details  on  well-ordered 
families,  see  [Natarajan  1987a], 

Theorem  7:  Let  M  be  a  meta-domain  on  an  alphabet  I.  If  there  exists  a  family  of  functions  F  from 
I*  ->  (0,1)  such  that 

(a ) F  contains  the  preimages  of  every  domain  in  A/, 

(b) There  exists  a  polynomial  pin)  such  that  every  function  in  F  is  computed  by  some  program  that 
runs  in  time  p(ri)  on  inputs  of  length  n, 

(c ) F  is  of  polynomial-dimension,  well-ordered  and  polynomial  time  orderable  by  an  algorithm  that 
outputs  the  pi»-time  bounded  programs  of  (b), 

then,  \l  allows  heuristics. 

Proof:  (sketch)  Let  F  be  a  family  as  above  and  let  A  be  ordering  for  it  as  in  (c)  above.  We  use  \  to 
construct  a  meta-algorithm  A'  for  A/  as  shown  below.  Essentially,  the  algorithm  uses  A  to  construct  gccd 
approximations  for  the  preimages  of  D  and  then  uses  these  preimages  to  build  an  algorithm  for  D  as  in 
Claim  4. 

Meta-Algorithm  A’ 
input  n,h,  D=(G,0) 

begin 

for  i  =  1  to  n  do 
for  each  oe  0  do 

Let  Fm  be  of  dimension  d. 
m  <—  n(nh\0\)d 

S<-  0; 

for  j  =  1  to  m  do 
Call  EXAMPLE  to  obtain  Lx,  oxy, 


«:vv'v;y»* 


1 

•M 

ij 


for  each  decomposition  of  ax 
into  a.oa-,,  la-,1  <  l.d-i  do 


S  u  (a,(.x)) ; 


s  < —  s  w 

od 

od 

Ut(o)  <-  A(S); 


output  the  following  as  the  algorithm  for  D; 
input  .t,  Lxl  =  n 

begin 

a  x-  null-sequence; 
for  i  =  1  to  n  do 

pick  o  such  that  .te  U(o) 
if  no  such  exists,  fail; 

X  X—  0(X)', 


output  a,  a  solution  sequence  for*. 

end 

end 

In  the  interest  of  brevity,  we  skip  a  formal  proof  that  Y  is  a  meta-algorithm  for  M  • 

Essentially,  Theorem  7  reduces  the  task  of  learning  in  this  framework  fo  one  of  learning  boolean 
valued  functions  in  Framework  i,  and  then  invokes  the  dimensionality  theorem  for  Framework  i  'r“e 
reader  should  not  jump  to  the  conclusion  that  the  role  of  the  examples  here  is  therefore  the  same  as  that 
in  Framework  1.  Even  in  the  absence  of  examples,  the  specification  of  the  domain  gives  the  learner 
sufficient  information  to  construct  an  algorithm  for  the  domain  The  examples  sen/e  only  to  speed  up  this 
computation  and  add  no  new  information  Hence  it  is  not  possible  here  to  make  a  distinction  analagcus 
to  the  distinction  between  polynomial-time  learnability  and  learnability  of  Framework  1.  a  distinction  that 
separated  the  information  complexity  of  concept  learning  from  the  computational  complexity  it  fcMcws 
that  tightening  Theorem  7  to  an  "only-if"  will  have  to  wait  until  the  "only-if"  counterpart  to  Theorem  4  s 
proved,  which  in  turn  waits  for  a  better  understanding  of  the  relationship  between  NP  and  RP 

5.  Conclusion 

This  paper  introduced  two  new  frameworks  for  learning 

The  first  framework  concerned  learning  functions  or  concepts  allowing  the  learner  to  see  Pc:1' 
examples  for  the  function  to  be  learned  as  well  some  background  information  or  "theory"  cn  .t  We 
showed  that  the  class  of  function  families  learnable  in  this  framework  ti  e.  from  few  examples  and  shod 
theories)  is  exactly  the  class  learnable  in  the  more  established  framework  of  [Valiant  1Q84]  <  i  e  ‘rcm  ‘ew 
examples  and  no  theories).  We  believe  that  this  result  will  better  motivate  these  m  the  Ap.ficui 
Intelligence  community  concerned  with  building  "learning  architectures"  The  proof  cf  the  a'emmem  emm 


\  ♦*»,  *' 
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result  directly  relates  the  length  of  a  piece  of  information  with  how  useful  it  is  to  the  learner  Although  the 
relationship  is  remarkably  simple,  it  required  the  formalization  of  a  learning  framework  to  perm,;  its 
interpretation  in  the  context  of  learning  from  background  information  and  examples 

The  second  framework  concerned  learning  in  the  sense  of  improving  computational  efhoiercy  'me 
framework  has  sufficient  structure  to  allow  a  crisp  analysis,  yet  is  rich  enough  to  capture  the  flavour  cf 
many  practical  problems.  We  proved  a  theorem  identifying  some  conditions  sufficient  to  allow  a  learn, ng 
algorithm  within  the  framework.  We  believe  that  this  framework  and  the  associated  theorem  are  cf 
significant  practical  import. 
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